Skip to content

Clarified INFO END deprecation status#844

Open
d-cameron wants to merge 1 commit intomasterfrom
vcf_end_clarification
Open

Clarified INFO END deprecation status#844
d-cameron wants to merge 1 commit intomasterfrom
vcf_end_clarification

Conversation

@d-cameron
Copy link
Copy Markdown
Contributor

Addresses concerns raised in #784

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 9, 2025

Changed PDFs as of 5132c8b: VCFv4.5 (diff).

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of \begin{environment-name}
Samples
\end{environment-name} With Data">
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended \begin{environment-name} … edit here?

@jmarshall jmarshall added the vcf label Sep 9, 2025

It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END.
That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF.
This approachs maintains backwards compatibility for unproblematic VCFs while attempting to minimise the probability of downstream data errors by making problematic records not valid for earlier versions of VCF (END was required for $<$*$>$ symbolic alleles).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"approachs" should be "approaches".

Those same tools will incorrectly interpret the size of the smaller symbolic structural variants and $<$*$>$ symbolic alleles when END is present.

It is recommended that VCFv4.5 files include END unless that VCF contains any record that could be misinterpreted by the presence of END.
That is, if there exists a sample or allele in which the END computed for that SVLEN or FORMAT LEN does not equal the maximum END, then no END should be present in any record that VCF.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the current wording confusing. May I suggest rephrasing along the following lines:

  • Clarify that END is a derived field. If it is absent, it can be computed in such and such way.
    (Therefore, not deprecated. Using the term deprecated raises unnecessary doubt: should newly written software still support END? The answer is yes, it must remain supported. So it’s better to avoid language that implies otherwise.)

  • Clarify the handling of inconsistencies. I do not fully understand what the other paragraphs are trying to convey. My interpretation is that they intend to describe what happens if END is computed incorrectly or conflicts with the primary information. Practically speaking, the responsibility lies with the producer to ensure consistency, and each program may choose how to handle discrepancies. If an analysis relies on the END tag, it will not recompute it from the primary fields (then we would not END in the first place). Conversely, if an analysis works directly from the primary fields, it is expected it will ignore END, since END is derived.

  • Clarify the comparison of END and LEN. If a comparison between END and LEN is important, the text should explain explicitly in what ways the two differ and in what ways they are equivalent. Although I am fairly familiar with the VCF format, the current paragraph did not make this distinction clear.

Copy link
Copy Markdown
Contributor Author

@d-cameron d-cameron Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that if an analysis relies on END, and multiple ALT alleles have end at different positions. The analysis will be silently wrong. E.g.: If a 4.5 VCF has something like POS=10 ALT=<DEL>,<DUP>,<*> SVLEN=10,20;END=40 LEN=.,.30, then the SVs will be interpreted as 30bp in length when they are actually shorter.

END is fine until you have multiple ALTs with different lengths. END is deprecated in the literal sense of it not being the preferred field to use. Should END be written in a fully 4.5-compliant ecosystem? No. It's redundant and unnecessary. Will we ever have a fully 4.5-compliant ecosystem? Also no, hence the wording in this PR around still writing it.

I do not fully understand what the other paragraphs are trying to convey.

They're conveying that there are 4.5 records that pre-4.5 software that uses END will misinterpret and results will silently be incorrect. Writing or not writing END doesn't change the fact that this is a backwards incompatible change - it just changes what it is that breaks. Would it help if I changed this to a recommendation that if you want pre-4.5 compatibility then don't write symbolic SVs in the same VCF record as gVCF <*> blocks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants